Symbolic Discriminant Analysis for Mining Gene Expression Patterns

نویسندگان

  • Jason H. Moore
  • Joel S. Parker
  • Lance W. Hahn
چکیده

Linear discriminant analysis is a popular multivariate statistical approach for classification of observations into groups because the theory is well described and the method is easy to implement and interpret. However, an important limitation is that linear discriminant functions need to be pre-specified. That is, specific variables need to be selected and added linearly into the model. Only the coefficients are estimated from the data. To address this limitation, we developed symbolic discriminant analysis (SDA) for the automatic selection of gene expression variables and discriminant functions that can take any form. Our SDA approach is inspired by the symbolic regression approach of Koza (1992). We begin by defining the mathematical functions (e.g. +, -, /, *, log, sqrt, etc.) and the list of gene expression variables that could potentially be used as the building blocks for discriminant functions. Symbolic discriminant functions are evaluated by generating discriminant scores for each observation to be classified. The overlap in distributions of discriminant scores between groups is an estimate of the classification error. Class membership for new observations can be predicted from the discriminant score that separates the distributions. To identify optimal symbolic discriminant functions from the near infinite model space, we employed parallel genetic programming for machine learning on 4 processors of a 110 processor Beowulf-style parallel supercomputer. We applied the SDA approach to identifying subsets of gene expression variables and symbolic discriminant functions that can correctly classify and predict types of human acute leukemia. Using a leave-one-out cross-validation strategy, we identified two different combinations of gene expression variables and symbolic discriminant functions that correctly classified 38/38 observations in the first dataset and correctly predicted 33/34 observations in the independent dataset. Genes identified in these two models included adipsin, erythroid beta-spectrin, nucloporin 98, and CD33. These are all genes associated with leukemia. We conclude that the SDA approach provides a powerful alternative to traditional multivariate statistical methods for identifying gene expression patterns. The advantages of SDA include the ability to identify an important subset of gene expression variables from among thousands of candidates and the ability to identify the most appropriate mathematical functions relating the gene expression variables to a clinical endpoint. We anticipate this will be an important methodology to add to the repertoire of approaches for mining gene expression patterns.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Microarray analysis of gene expression patterns in Arabidopsis seedlings under trehalose, sucrose and sorbitol treatment

Trehalose is the non-reducing alpha-alpha-1, 1-linked glucose disaccharide. The biosynthesisprecursor of trehalose, trehalose-6-phosphate (T6P), is essential for plant development, growth,carbon utilization and alters photosynthetic capacity but its mode of action is not understood. In thecurrent research, 6 days old seedlings of Arabidopsis thaliana (Columbia ecotype) were grown inliquid cultu...

متن کامل

Facial expression recognition based on Local Binary Patterns

Classical LBP such as complexity and high dimensions of feature vectors that make it necessary to apply dimension reduction processes. In this paper, we introduce an improved LBP algorithm to solve these problems that utilizes Fast PCA algorithm for reduction of vector dimensions of extracted features. In other words, proffer method (Fast PCA+LBP) is an improved LBP algorithm that is extracted ...

متن کامل

GSTF1 Gene Expression Analysis in Cultivated Wheat Plants under Salinity and ABA Treatments

Most plants encounter stress such as drought and salinity that adversely affect growth, development and crop productivity. The expression of the gene glutathione-s-transferases (GST) extends throughout various protective mechanisms in plants and allows them to adapt to unfavorable environmental conditions. GSTF1 (the first phi GSTFs class) gene expression patterns in the wheat cultivars Mahuti ...

متن کامل

O-30: Comparing Expression Patterns of Endometrial Genes in Implantation Failures and Recurrent Miscarriages with Fertile Couples Following ICSI/IVF Using in Silico Analysis

Background: To screen and diagnose patients with recurrent abortions and implantation failure after IVF/ICSI, differentially expressed genes of endometrium through DNA microarrays were monitored. Materials and Methods: Microarray expression profile of GSE26787 dataset from GEO database was used to analyze gene expression profiles of 15 endometrial biopsy samples- five from control fertile (CF) ...

متن کامل

Prediction of Blasting Cost in Limestone Mines Using Gene Expression Programming Model and Artificial Neural Networks

The use of blasting cost (BC) prediction to achieve optimal fragmentation is necessary in order to control the adverse consequences of blasting such as fly rock, ground vibration, and air blast in open-pit mines. In this research work, BC is predicted through collecting 146 blasting data from six limestone mines in Iran using the artificial neural networks (ANNs), gene expression programming (G...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001